Ethnologue maintain a database of resources on world languages and it hosted over 7000 languages profiles online. The data are mined from their open website to compile a list of language profiles. We will analyse 2 particular properties compiling of each language usage and its written form (written vs unwritten).
We took those languages that have description on its language usage and group them by its language status defined in Expanded Graded Intergenerational Disruption Scale (EGIDS). We then compile a dictionary to list out the most used words after we cleaned the text from English stop words, numbers and puntuations.
We then filtered high frequencies terms that do not have any significant meaning, the frequency cut off point is chosen to be 1400.
topfeatures(lang.use.dfm, 20)
## use also ages used attitudes positive
## 4983 4539 1727 1482 1373 1205
## l2 vigorous domains home english eng
## 1198 915 878 858 790 768
## language children adults older spanish spa
## 626 625 578 497 491 465
## especially speakers
## 423 362
lang.use.dfm<-dfm_trim(lang.use.dfm, max_count=1400)
We then compute the similarity on language usage between each EGIDS score using extended Jaccard index method and apply hierarchical clustering via Ward’s methodology to understand if any of the EGIGS are similar and can be clustered together.
Using words describing how the languages are use, we able to see some similarity between each language EGIDS score that are close to each other. The cluster formations are very close to the original definition of the EGIDS index where 0-4 are Institutional, 5 represent Developing, 6a represent Vigorous (usage), 6b-7 for In Trouble, 8a-9 for Dying and 10 represent Extinct.
lang.use.dist<-textstat_simil(lang.use.dfm, method="eJaccard")
corrplot(as.matrix(lang.use.dist), hclust.method = "ward.D2", order="hclust", addrect=4)
Dendrogram showing the hierarchical cluster.
h<-hclust(as.dist(1-as.matrix(lang.use.dist)), method="ward.D2")
plot(h)
abline(h=1.0, lty=2)
group<-cutree(h, k=4)
topStatusFeatures<-sapply(seq(1,max(group)), function(x) {
topfeatures(dfm_select(lang.use.dfm, documents=names(group[which(group==x)])))
}, simplify = F)
names(topStatusFeatures) <- sapply(seq(1,max(group)), function(x) {paste(sort(names(group[which(group==x)])), collapse=",")})
topStatusFeatures[c(4,1,2,3)]
## $`1,2,3,4`
## l2 attitudes positive domains english eng vigorous
## 246 177 167 142 122 119 111
## language home european
## 68 57 52
##
## $`5,6a,6b`
## attitudes positive l2 vigorous home domains english
## 1085 976 877 801 688 685 446
## children eng language
## 439 433 366
##
## $`7,8a,8b`
## adults older english eng shifting wurm speakers children
## 332 332 187 182 176 166 149 147
## language mainly
## 146 144
##
## $`10,9`
## shifted language english eng portuguese
## 116 46 35 34 25
## por revitalization golla speakers speak
## 23 17 16 15 14
From the EGIDS cluster, we list out the top words. We see that languages in EGIDS 1-4 are often use as 2nd language (L2) and the speakers generally having positive attitudes towards the language. It is also being used in certain or all domains in their society.
EGIDS 5, 6a and 6b, the speakers do have positive attitudes but lack of using it as 2nd language. It also still being use vigorously but may limit to certain domains or home.
EGIDS 7, 8a and 8b, these languages are mainly use by adults or older generation, and seem to also know English and are shifting to use other languages.
And lastly, for EGIDS 9 and 10, the speakers have shifted to other language, probably English or Portuguese.
We look at the language written form and categorise languages that have some script in writing as “Written” versus category of “Unwritten” which was declared in the qualitative variables in the language profile.
We analyse the ratio between written versus unwritten for all the languages that we clustered in the first section.
kable(data.ele)
| Unwritten | Written | Total | |
|---|---|---|---|
| 1,2,3,4 | 7 | 474 | 481 |
| 5,6a,6b | 653 | 2432 | 3085 |
| 7,8a,8b | 257 | 281 | 538 |
| 10,9 | 48 | 22 | 70 |
plotPie()
The pie chart shows that language that is widely used (EGIDS 1-4 and EGIDS 5, 6a and 6b) mainly have some form of written text (> 75% of languages). Compared to languages that are dying and extinct, which have high percentage of languages that do not have any written text.
This may show that well adopted languages are more developed with form of written text which can be institutional adopted and widely transmitted.
require(dplyr)
require(quanteda)
require(corrplot)
lang.prop<-read.csv("../input/ethnologue.csv", na.strings = "")
lang.prop.complete<-filter(lang.prop, complete.cases(language_use, classification))
lang.prop.complete<-lang.prop.complete %>% mutate(stat_num=gsub("^([0-9a-z]+) .*", "\\1", language_status))
lang.use.cat<-do.call("rbind", sapply(unique(lang.prop.complete$stat_num), function(s) {
c<-which(lang.prop.complete$stat_num == s)
data.frame(cat=s, language_use = paste(lang.prop.complete$language_use[c], collapse = " "),
stringsAsFactors = F)
}, simplify = F))
lang.use.corpus<-corpus(lang.use.cat$language_use)
docnames(lang.use.corpus) <- lang.use.cat$cat
lang.use.dfm <- dfm(lang.use.corpus,
remove_numbers=T,
remove = stopwords(), stem = F, remove_punct = TRUE)
require(knitr)
write.na<-which(is.na(lang.prop.complete$writing))
lang.prop.complete.writing<-lang.prop.complete[-write.na,]
lang.prop.complete.writing<-lang.prop.complete.writing %>% mutate(writing.script=gsub("(^[a-zA-Z]+) .*", "\\1", writing)) %>% mutate(writing.script=ifelse(writing.script!="Unwritten","Written", writing.script))
data<-sapply(seq(1,max(group)), function(x) {
sel<-which(lang.prop.complete.writing$stat_num %in% names(group[group==x]))
table(lang.prop.complete.writing[sel, "writing.script"])
}, simplify = F)
names(data) <- sapply(seq(1,max(group)), function(x) {paste(sort(names(group[which(group==x)])), collapse=",")})
data<-do.call("rbind", data[c(4,1,2,3)])
r<-row.names(data)
data.ele<-mutate (as.data.frame(data), Total=rowSums(data))
row.names(data.ele) <-r
plotPie<-function() {
require(plotly)
colortone<-list(
colors = c(rgb(253,174,97, maxColorValue=255), rgb(43,131,186, maxColorValue=255)))
subTitleStyle<-list(
font = list(family = "Courier New, monospace", size = 16, color = "black"),
xref = "paper",
yref = "paper",
yanchor = "bottom",
xanchor = "center",
align = "left",
showarrow = FALSE)
plot_ly() %>%
add_pie(data = data.frame(n=data[1,], doe=names(data[1,])), labels = ~doe, values = ~n,
textposition="outside", direction="clockwise", sort=F, rotation=0,
marker=colortone,
name = "Status: 1,2,3,4", domain = list(x = c(0, 0.4), y = c(0.6, 1))) %>%
layout(annotations=append(list(text="Status: 1,2,3,4", x=0.1, y=0.5), subTitleStyle)) %>%
add_pie(data = data.frame(n=data[2,], doe=names(data[2,])), labels = ~doe, values = ~n,
textposition="outside", direction="clockwise", sort=F, rotation=0,
name = "Status: 5,6a,6b", domain = list(x = c(0.6, 1), y = c(0.6, 1))) %>%
layout(annotations=append(list(text="Status: 5,6a,6b", x=1, y=0.5), subTitleStyle)) %>%
add_pie(data = data.frame(n=data[3,], doe=names(data[3,])), labels = ~doe, values = ~n,
textposition="outside", direction="clockwise", sort=F, rotation=180,
name = "Status: 7,8a,8b", domain = list(x = c(0, 0.4), y = c(0.4, 0))) %>%
layout(annotations=append(list(text="Status: 7,8a,8b", x=0.1, y=-0.1), subTitleStyle)) %>%
add_pie(data = data.frame(n=data[4,], doe=names(data[4,])), labels = ~doe, values = ~n,
textposition="outside", direction="clockwise", sort=F, rotation=180,
name = "Status: 9,10", domain = list(x = c(0.6, 1), y = c(0.4, 0))) %>%
layout(annotations=append(list(text="Status: 9,10", x=1, y=-0.1), subTitleStyle)) %>%
layout(title = 'Written languages vs unwritten languages', showlegend = T,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = F),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = F))
}